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Abstract 

We introduce a new methodology for the determination of amino- 
acid substitution matrices for use in the alignment of proteins. The new 
methodology is based on a pre-existing set cover on the set of residues 
and on the undirected graph that describes residue exchangeability given 
the set cover. For fixed functional forms indicating how to obtain edge 
weights from the set cover and, after that, substitution-matrix elements 
from weighted distances on the graph, the resulting substitution matrix 
can be checked for performance against some known set of reference align- 
ments and for given gap costs. Finding the appropriate functional forms 
and gap costs can then be formulated as an optimization problem that 
seeks to maximize the performance of the substitution matrix on the refer- 
ence alignment set. We give computational results on the BAliBASE suite 
using a genetic algorithm for optimization. Our results indicate that it is 
possible to obtain substitution matrices whose performance is either com- 
parable to or surpasses that of several others, depending on the particular 
scenario under consideration. 

Keywords: Sequence alignment, Substitution matrix, Residue set cover. 

1 Introduction 

One of the most central problems of computational molecular biology is to 
align two sequences of residues, a residue being generically understood as a 
nucleotide or an amino acid, depending respectively on whether the sequences 
under consideration are nucleic acids or proteins. This problem lies at the 
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heart of several higher-level applications, such as heuristically searching se- 
quence bases ^JETJEGl or aligning a larger number of sequences concomitantly 
PS1 IMI l2"2l 1551 1391 1551 IT7| for the identification of special common substructures 
(the so-called motifs, cf. ^3 [7JI7JJEU1 ESI) that encode structural or functional 
similarities of the sequences [70IEH1IE1E] or yet the sequences' promoter regions 
in the case of nucleic acids [73], for example. 

Finding the best alignment between two sequences is based on maximizing 
a scoring function that quantifies the overall similarity between the sequences. 
Normally this similarity function has two main components. The first one is a 
symmetric matrix, known as the substitution matrix for the set of residues under 
consideration, which gives the contribution the function is to incur when two 
residues are aligned to each other. The second component represents the cost 
of aligning a residue in a sequence to a gap in the other, and gives the negative 
contribution to be incurred by the similarity function when this happens. There 
is no consensually accepted, general-purpose criterion for selecting a substitution 
matrix or a gap-cost function. Common criteria here include those that stem 
from structural or physicochemical characteristics of the residues (e.g., ["H31I251 
BUEUESIEI) an d those that somehow seek to reproduce well-known alignments 
as faithfully as possible (e.g., g71 C3I ESI EOl EH OH E3 E31 ESI El E21 EH E21 El 
E3 EDI ED ESI IE3) Useful surveys include [77J El [73| . 

We then see that, even though an optimal alignment between two sequences 
is algorithmically well understood and amenable to being computed efficiently, 
the inherent difficulty of selecting appropriate scoring parameters suggests that 
the problem is still challenging in a number of ways. This is especially true 
of the case of protein alignment, owing primarily to the fact that the set of 
residues is significantly larger than in the case of nucleic acids, and also to the 
existence of a multitude of criteria whereby amino acids can be structurally or 
functionally exchanged by one another. 

For a given structural or physicochemical property (or set of properties) of 
amino acids, this exchangeability may be expressed by a set cover of the set of 
all amino acids, that is, by a collection of subsets of that set that includes every 
amino acid in at least one subset. Each of these subsets represents the possibility 
of exchanging any of its amino acids by any other. Set covers in this context 
have been studied extensively E7JE2E3E3E3EniEiElEHl[HIlEIl[7J[ini[32 
and constitute our departing point in this paper. As we describe in Sectional we 
introduce a new methodology for discovering both an appropriate substitution 
matrix and gap-cost parameters that starts by considering an amino-acid set 
cover. It then builds a graph from the set cover and sets up an optimization 
problem whose solution is the desired substitution matrix and gap costs. 1 

The resulting optimization problem is defined on a set of target sequence 
pairs, preferably one that embodies as great a variety of situations as possible. 
The target pairs are assumed to have known alignments, so the optimal solution 
to the problem of finding parameters comprises the substitution matrix and the 

1 Our new methodology is ultimately related to the work of several other authors that have 
dealt with the issue of assessing the efficacy of a substitution matrix or its relation to possible 
groupings of amino acids. We refer the interested reader to |3S 79 46 24 371 . for example. 
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gap costs whose use in a predefined alignment algorithm yields alignments of 
the target pairs that in some sense come nearest the known alignments of the 
same pairs. Our optimization problem is set up as a problem of combinato- 
rial search, being therefore highly unstructured and devoid of any facilitating 
differentiability properties. Reasonable ways to approach its solution are then 
all heuristic in nature. In Section [JJ we present the results of extensive com- 
putational experiments that employ an evolutionary algorithm and targets the 
BAliBASE pairs of amino-acid sequences [711 13^. 

Notice, in the context of the methodology categorization we mentioned ear- 
lier in passing, that our new methodology is of a dual character: it both relies 
on structural and physicochcmical similarities among amino acids and depends 
on a given set of aligned sequences in order to arrive at a substitution matrix 
and gap costs. We return to this hybrid aspect of our methodology in Section^ 
where conclusions are given. 

2 The methodology 

We describe our methodology for sequences on a generic set R of residues and 
only specialize it to the case of proteins in Sectional Given two residue sequences 
X and Y of lengths x and y, respectively, a global alignment of X and Y can 
be expressed by the 2 x z matrix A having the property that its first line, when 
read from left to right, is X possibly augmented by interspersed gaps, the same 
holding for the second line and Y, so long as no column of A comprises gaps only. 
It follows that z > x,y. In the case of a local alignment, that is, an alignment 
of a subsequence of X and another of Y , this matrix representation remains 
essentially unchanged, provided of course that x and y are set to indicate the 
sizes of the two subsequences. 

For a given substitution matrix S and a pair (h, g) of gap costs, 2 the simi- 
larity score of alignment A, denoted by Fg' 9 (A), is given by 

z 

F^{A) = Y,fs' 9 W,j),A{2,j)), (1) 
j'=i 

where f s ' a (A(l,j),A(2,j)) gives the contribution of aligning A(l,j) to A(2,j) 
as either S(A(l,j), A(2,j)), if neither A(l,j) nor A(2,j) is a gap; or — (h + g), 
if either A(l,j) or A(2,j) is the first gap in a contiguous group of gaps; or yet 
—g, if either A(l,j) or A(2,j) is the kth gap in a contiguous group of gaps for 
k > 1. An optimal global alignment of X and Y is one that maximizes the 
similarity score of |QJ over all possible global alignments of the two sequences. 
An optimal local alignment of X and Y, in turn, is the optimal global alignment 
of the subsequences of X and Y for which the similarity score is maximum 
over all pairs of subsequences of the two sequences. The set of all optimal 

2 For k > 0, we assume the customary affine function p(k) = h + gk with h, g > to express 
the cost of aligning the fcth gap of a contiguous group of gaps in a line of A to a residue in 
the other line as p(k) — p(k — 1), assuming p(0) = 1641 . 
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alignments of X and Y may be exponentially large in x and y, but it does 
nonetheless admit a concise representation as a matrix or directed graph that 
can be computed efficiently by well-known dynamic programming techniques 
jHOlElEni^ni regardless of whether a global alignment of the two sequences is 
desired or a local one We refer to this representation as A* x y 

Our strategy for the determination of a suitable substitution matrix starts 
with a set cover C = {C\ , . . . , C c } of the residue set R, that is, C is such that C\ U 
• • -UCc = R. Next we define G to be an undirected graph of node set R having an 
edge between two nodes (residues) u and v if and only if at least one of C\ , . . . , C c 
contains both u and v. Graph G provides a natural association between how 
exchangeable a node is by another and the distance between them in the graph. 
Intuitively, the closer two nodes are to each other in G the more exchangeable 
they are and we expect an alignment of the two to contribute relatively more 
positively to the overall similarity score. Quantifying this intuition involves 
crucial decisions, so we approach the problem in two careful steps, each leaving 
considerable room for flexibility. The first step consists of turning G into a 
weighted graph, that is, assigning nonnegative weights to its edges, and then 
computing the weighted distance between all pairs of nodes. 3 The second step 
addresses the turning of these weighted distances into elements of a substitution 
matrix so that larger distances signify ever more restricted exchangeability. 

Let us begin with the first step. For (u, v) an edge of G, let w(u, v) denote its 
weight. We define the value of w(u, v) on the premise that, if the exchangeability 
of u and v comes from their concomitant membership in a large set of C, then it 
should eventually result in a smaller contribution to the overall similarity score 
than if they were members of a smaller set. In other words, the former situation 
bespeaks an intuitive "weakness" of the property that makes the two residues 
exchangeable. In broad terms, then, we should let w(u, v) be determined by the 
smallest of the sets of C to which both u and v belong, and should also let it be 
a nondecreasing function of the size of this smallest set. 

Let c~ be the size of the smallest set of C and c + the size of its largest set. 
Let c~ v be the size of the smallest set of C of which both u and v are members. 
We consider two functional forms according to which w(u,v) may depend on 
c~ v as a nondecreasing function. Both forms force w(u, v) to be constrained 
within the interval [w~,w + ] with w~ > 0. For A > 1, the first form is the 
convex function 



Having established weights for all the edges of G, let d UtV denote the weighted 
distance between nodes u and v. Clearly, d u _ u — and, if no path exists in 

3 Weight non-negativity is crucial here, since the presence of negative weights may render 
the weighted-distance problem ill-posed 1121 . 





while the second is the concave function 
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G between u and v (i.e., G is not connected and the two nodes belong to two 
different connected components), then d u , v = oo. 

Carrying out the second step, which is obtaining the elements of the sub- 
stitution matrix from the weighted distances on G, involves difficult choices as 
well. While, intuitively, it is clear that residues separated by larger weighted 
distances in G are to be less exchangeable for each other than residues that 
are closer to each other (in weighted terms) in G, the functional form that 
the transformation of weighted distances into substitution-matrix elements is to 
take is once again subject to somewhat arbitrary decisions. What we do is to 
set S(u,v) — if d UjV = oo, and to consider two candidate functional forms for 
the transformation in the case of finite distances. 

Let us initially set [S~ , S + ) as the interval within which each element of the 
substitution matrix S is to be constrained (we assume S~ > for consistency 
with the substitution-matrix element that goes with an infinite distance, whose 
value we have just set to 0). Let us also denote by d + the largest (finite) 
weighted distance occurring in G for the choice of weights at hand. We then 
consider two functional forms for expressing the dependency of S(u,v), as a 
nonincreasing function, upon a finite d u>v . For fj, > 1, we once again consider a 
convex function, 



In Figure^we provide examples of the candidate functional forms for w±(u, v), 
w 2 (u, v), Si(u, v), and S^u, v) as given by 101-10), respectively. Each functional 
form is illustrated for two A or /x values, as the case may be. 

Once we decide on one of the two functional forms |J2J or ©, and similarly 
on one of @ or JSJ , and also choose values for w~ , w + , A, S~ , S + , and /x, then 
the substitution matrix S as obtained from C is well-defined and, together with 
the gap-cost parameters h and <?, can be used to find the representation A* x y of 
the set of all optimal (global or local) alignments between the two sequences X 
and Y . The quality of our choices regarding functional forms and parameters, 
and hence the quality of the resulting 5, h, and g, can be assessed if a reference 
alignment, call it A XY , is available for the two sequences. When this is the 
case, we let p^' 9 (A x Yl A* x Y ) be the fraction of the columns of A r x Y that also 
appear in at least one of the alignments that are represented in A* x Y . 4 The 
substitution matrix S, and also h and g, are then taken to be as good for A r x Y 

as Ps' 9 (A x . Y ,A X Y ) is close to 1. 

4 This definition must be read with care. If a certain column of A r x Y refers to a certain 
occurrence of residue a in X and of residue fi in Y, then it counts towards Pg 9 {A r x Y , A* x Y ) 
only if the same two occurrences of a and j3 are aligned to each other in at least one of the 
alignments represented in A* x Y - The cases in which a residue in one of the two sequences is 




(4) 



and a concave one, 
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Figure 1: Illustrative plots for ©-(HI), respectively in (a)-(d). In (a) and (b), 
w" = 2, w + = 4, cr = 3, c+ = 7, and A = 2 (solid plot) or A = 9 (dashed 
plot). In (c) and (d), 5" = 2, S + = 4, d+ = 7, and (jt = 2 (solid plot) or fi = 9 
(dashed plot). 
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Table 1: Parameters and their domains. 


Parameter 


Description 


Domain 




b w 


Selects between (J5J) and (J3J 


{1,2} 




w" 


Least possible edge weight 


{0.5,0.55, 


...,1} 


w + 


Greatest possible edge weight 


{1,1.125,. 


-.,5} 


A 


Exponent for use in or J3J 


{1,1.125,. 


-.,5} 


b s 


Selects between (@J and (0 


{1,2} 




s- 


Least possible element of S 


{0.5,0.55, 


...,1} 


s+ 


Greatest possible element of S 


{1,1.25,.. 


.,25} 


/' 


Exponent for use in (0J or JSJ 


{1,1.125,. 


-.,5} 


h 


Initialization gap cost 


{2,2.5,... 


,30} 


9 


Extension gap cost 


{0.25, 0.375,..., 5} 



Thus, given a residue set cover C and a set A r of reference alignments (each 
alignment on a different pair of sequences over the same residue set R) , obtaining 
the best possible substitution matrix S and gap-cost parameters h and g can be 
formulated as the following optimization problem: find functional forms and pa- 
rameters that maximize some (for now unspecified) average of p s ' 9 {A r x Y , A* x Y ) 
over all pairs (X, Y) of sequences such that A r x Y G A r . In the next section, we 
make this definition precise when residues are amino acids and proceed to the 
description of computational results. 

3 Computational results 

Let b w be a two-valued variable indicating which of J2J) or © is to be taken 
as the functional form for the edge weights, and similarly let bs indicate which 
of (0} or (J5J is to give the functional form for the elements of S. These new 
parameters defined, we begin by establishing bounds on the domains from which 
each of the other eight parameters involved in the optimization problem may 
take values, and also make those domains discrete inside such bounds by taking 
equally spaced delimiters. For the purposes of our study in this section, this 
results in what is shown in Tabled 

The parameter domains shown in Table ^make up for over 3.7 trillion pos- 
sible combinations, yielding about 1.6 billion different substitution matrices. 5 
The set of all such combinations seems to be structured in no usable way, so 
finding the best combination with respect to some set of reference alignments as 

aligned to a gap in the other in A T X Y are entirely analogous. The required bookkeeping in 
any of the cases is simple to perform if one resorts to the matrix or directed graph that gives 
the structure of A* x Y ■ 

5 This is only a rough estimate, since there are combinations that yield the same substitution 
matrix. For example, setting A = 1 renders b w needless, the same holding for /i and bg. In a 
similar vein, setting w~ = w + = 1 renders both b w and A needless, and similarly for bs and 
li (together with b w , w~ , ui+, and A) when S~ = S + = 1. 
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discussed in Section[21must not depend on any technique of explicit enumeration 
but rather on some heuristic approach. 

The approach we use in this section is to employ an evolutionary algorithm 
for finding the best possible combination within reasonable time bounds. Each 
individual for this algorithm is a 10-tuple indicating one of the possible combina- 
tion of parameter values. Our evolutionary algorithm is a standard generational 
genetic algorithm |48| . It produces a sequence of 100-individual generations, 
the first of which is obtained by randomly choosing a value for each of the 
10 parameters in order to produce each of its individuals. Each of the subse- 
quent generations is obtained from the current generation by a combination of 
crossover and mutation operations, following an initial elitist step whereby the 
5 fittest individuals of the current generation are copied to the new one. While 
the new generation is not full, either a pair of individuals is selected from the 
current generation to undergo crossover (with probability 0.5) or one individual 
is selected to undergo a single-locus mutation (with probability 0.5). 6 The pair 
of individuals resulting from the crossover, or the single mutated individual, is 
added to the new generation, unless an individual that is being added is identi- 
cal to an individual that already exists in the population. When this happens, 
the duplicating individual is substituted for by a randomly generated individ- 
ual. Selection is performed in proportion to the individuals' linearly normalized 
fitnesses. 7 

The crux of this genetic algorithm is of course how to assess an individual's 
fitness, and this is where an extant set of reference alignments A r comes in. In 
our study we take A r to be the set of alignments present in the BAliBASE suite 
PJ. It contains 167 families of amino-acid sequences arranged into eight refer- 
ence sets. For each family of the first five reference sets two pieces of reference 
information are provided: a multiple alignment of all the sequences in the family 
and a demarcation of the relevant motifs given the multiple alignment. Families 
in the remaining three reference sets are not provided with motif demarcations, 
so we refrain from using them in our experiments, since the fitness function that 
we use relies on reference motifs as well. Note that, even though the BAliBASE 
suite is targeted at multiple sequence alignments (cf. [73 EH| f° r example ap- 
plications), each such alignment trivially implies a pair wise alignment for all 
sequence pairs in each family and also motif fragments for each pair. Our set 
A r then comprises every sequence pair from the BAliBASE suite for which a 
reference alignment exists with accompanying motif demarcation. 

The organization of the BAliBASE suite suggests a host of possibilities for 
evaluating the efficacy of a substitution matrix S and of gap-cost parameters h 

e Both the crossover point and the locus for mutation are chosen at random, essentially 
with the parameters' domains in mind, so that the probability that such a choice singles out a 
parameter whose domain has size a is proportional to log a. Mutating the parameter's value 
is achieved straightforwardly, while breaking the 10-tuples for crossover requires the further 
step of interpreting the parameter as a binary number. 

7 This means that, for 1 < k < 100, the kth fittest individual in the generation is selected 
with probability proportional to L — (L — l)(k — 1)/99, where L is chosen so that the expression 
yields a value L times larger for the fittest individual than it does for the least fit (for which 
it yields value 1). We use L = 10 throughout. 



8 



and g. For a pair of sequences (X, Y), whose reference alignment is A r x Y 6 A r , 
and recalling that A* x Y represents the set of all optimal alignments of X and 

Y given S, h, and g, we use four variants of the Pg' 9 {A r x Y , A X Y ) of Section |21 
as the bases of the fitness function to be used by the genetic algorithm. These 
are denoted by Pg\{A r XY ,A XY ) through p h g \{A r x Y , A* x Y ) and differ among 
themselves as to which of the columns of the reference alignment are checked to 
be present in at least one of the optimal alignments. We let them be as follows: 

• Ps'i(A x Y , A* x Y ) is based on all the columns of A r x Y ', 



h,q 



Pg?>{A x Y , A* x Y ) is based on all the columns of A r XY that contain no 
gaps; 

• p s ' 9 3 (A r x Y , A* x Y ) is based on all the columns of A r XY that lie within 
motifs; 

• p h g\{A r x Y , A* x Y ) is based on all the columns of A r XY that lie within 
motifs and contain no gaps. 

These defined, we first average each one of them over A r before combining 
them into a fitness function. The average that we take is computed in the 
indirectly weighted style of |79j , which aims at preventing any family with overly 
many pairs, or any pair on which S, h, and g are particularly effective, from 
influencing the average too strongly. The weighting takes place on an array 
having 10 lines, one for each of the nonoverlapping 0.1-wide intervals within 
[0, 1], and one column for each of the BAliBASE families. Initially each pair 
(X, Y) having a reference alignment A r x Y in A r is associated with the array 
cell whose column corresponds to its family and whose line is given by the 
interval within which the identity score of the reference alignment A r x Y falls. 
This score is the ratio of the number of columns of A x Y whose two amino 
acids are identical to the number of columns that have no gaps (when averaging 
pg%{A r x Y ,A X y) or p h g%{A r x Y , A* x Y ), only columns that lie within motifs are 
taken into account). 

For 1 < k < 4, we then let p s ' 9 k (A r ) be the following average of pg 9 k {A r x Y , A x Y ) 

over A r . First take the average of Pg' k (A x Y , A* x Y ) for each array cell over the 
sequence pairs (X, Y) that are associated with it (cells with no pairs are ignored). 
Then p s ' 9 k (A r ) is computed by first averaging those averages that correspond 
to the same line of the array and finally averaging the resulting numbers (note 
that lines whose cells were all ignored for having no sequence pairs associated 
with them do not participate in this final average). 

We are then in position to state the definition of our fitness function. We 
denote it by ip^' 9 (A r ) to emphasize its dependency on how well S, h, and g 
lead to alignments that are in good accord with the alignments of A r . It is 
given by the standard Euclidean norm of the four-dimensional vector whose fcth 
component is Pg' 9 k (A r ), that is, 



Vg' 9 (A r ) = \l p s i(An . (6) 



h,9l 



h,g/ 



2 
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Table 2: Substitution matrices used for comparison. 



Matrix 



Original reference Reference for h and g 



BC0030 

BENNER74 

BL0SUM62 

FENG 

GONNET 

MCLACH 

NWSGAPPEP 

PAM250 

RAO 

RUSSELL-RH 
VTML160 
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Clearly, < (p h s ,9 {A r ) < 2 always. 

The substitution matrices we have used for comparison are shown in Ta- 
ble |2j 8 where for each one we give its most common epithet, the reference to 
where it was originally described, and, when different from the former, the ref- 
erence to where the gap-cost parameters h and g we use with it are to be found 
for both global and local alignments. This table is supplemented by Table 
where for each matrix we show the value of <p l g 9 (A r ) for both the global- and the 
local- alignment case; numbers in bold typeface are the minimum and maximum 
of the corresponding column. Table Ogives the two set covers we have used: X 
is the set cover from [33], S the one from [65| . 

One first set of results is summarized in the plots of Figure El and also in 
Table El Each of the plots in the figure indicates the evolution of ip^' 9 (A r ) 
as the genetic algorithm is run for each of the four combinations of global or 
local alignments with the X or S set cover. At each generation, what is plotted 
is the greatest value of ip s ' 3 {A r ) for individuals of that generation, S being 
the substitution matrix that corresponds to each individual as explained in 
Section[2] We present each plot against two constant values (indicated as dashed 
lines) giving the corresponding minimum and maximum highlighted in Table [5] 
The best individual of the last generation of each run is shown as a column 
in Table [5] containing the corresponding parameter values. Each of Table [S]s 
columns therefore corresponds to a substitution matrix, the one output by the 
corresponding run of the genetic algorithm, with accompanying gap costs. 

The first notable feature of the four plots in Figure[2]is that, in all cases, the 
fittest individual of the initial generation is already well placed with respect to 
the substitution matrices of Table [21 even though this generation is the result 

8 The denomination NWSGAPPEP is taken from 1201 . whose GCG software package was orig- 
inally described by [141 . For global alignments, we use BL0SUM62 to refer to a version of the 
matrix that has nonnegative elements exclusively (this version is obtained by adding the ab- 
solute value of the least element of the original matrix to all other elements, provided at least 
one negative element exists). The same holds for local alignments, in this case for the matrices 
BENNER74, GONNET, and PAM250 as well. 
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Table 3: Values of (f s ' 9 (A r ) for the matrices of Table [5] under global or local 
alignments. 



s 


Global alignments 


Local alignments 


BC0030 


1.5226 


1.5060 


BENNER74 


1.5601 


1.5348 


BLDSUM62 


1.5532 


1.5542 


FENG 


1.5062 


1.4950 


GONNET 


1.5419 


1.5373 


MCLACH 


1.5415 


1.5371 


NWSGAPPEP 


1.5306 


1.5181 


PAM250 


1.5243 


1.5064 


RAO 


1.4864 


1.4912 


RUSSELL-RH 


1.4508 


1.4508 


VTML160 


1.5734 


1.4296 



Table 4: Set covers. 



Cover set 


2 


S 


Ci 


{M,I,L,V} 


{P} 


c 2 


{M,I,L,V,A,P} 


{AG} 


c. 


{M,I,L,V,F,W} 


{D,E} 


c 4 


{M,I,L,V,A,P,F,W} 


{N,Q} 


c 5 


{D, E, H, R, K] 


{S,T} 


Co 


{S,T,Q,N} 


{F,W,Y} 


c 7 


{S,T,Q,N,D,E} 


{H, K, R] 




{Q,N,D,E,H, R, K} 


{I,L,V} 


c 9 


{S,T,Q,N,D,E,H,R,K} 


{C,F,I,L,M,V,W,Y} 


Cio 


{Q,N} 


{D, E, H, K, N,Q,R,S, T] 


Cxi 


{D,E,Q,N} 




Cl2 


{H, R, K} 




Cl3 


{R,K} 




C14 


{F,W 7 Y} 




C15 


{G,N} 




Cl6 


{A,C,G,S} 




C17 


{S,T} 




C\8 


{D,E} 
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Figure 2: Evolution of the fitness, as given by ©, under global (a and b) or 
local (c and d) alignments for the I (a and c) or S (b and d) set covers. 



Table 5: Values of the parameters of Tabled at the end of each of the four runs 
depicted in Figure [21 Indications in parentheses refer to which of parts (a)-(d) 
of the figure the columns correspond. 
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of a random selection of parameter values for each of its individuals. This, 
alone, is in our opinion solid indication that the essential underlying premise 
of our new methodology — that the elements of a substitution matrix can be 
computed as a function of weighted distances on the undirected graph that 
represents a certain amino-acid set cover — is sound. From the initial generations 
onward, in all four cases some rapid progress is made initially, and then htness 
improvements become more and more sporadic. This is no surprise if we consider 
that the fitness landscape we are dealing with is completely non-differentiable 
and probably highly rugged (i.e., with many local maxima) as well, which is in 
fact the reason why we give mutations the high prominence of a 50% chance as 
a new generation is being filled. 

The question, of course, is whether running the genetic algorithm beyond 
the 270 generations of the figure can lead it to eventually find individuals whose 
fitnesses go beyond the uppermost dashed lines in the plots (that is, individ- 
uals that surpass the best-performing matrices on the reference alignments in 
A r ). Seemingly, this would require some sort of phase-transition behavior fol- 
lowing the slow progress that the plots depict past the first 50 generations or so. 
While such a behavior is known to occur relatively often when handling hard, 
unstructured optimization problems (cf., e.g., [2] f° r a recent example from 
combinatorial optimization), in our case carrying over with the algorithm for 
each single generation has required roughly 13 to 14 hours, 9 so at first seeking 
significant further improvement does seem impractical. 

Notice, however, that practically all of this time consumption is related to 
computing ip'g 9 (A r ) for each individual in the current population. Because this 
is done in a manner that is fully independent from any other individual, we 
can speed the overall computation up nearly optimally by simply bringing more 
processors into the effort. 10 

Our second set of results carries the genetic algorithm well beyond the 270 
generations of Figure |2 To this end we employed the parallel strategy outlined 
above on four processors, and also concentrated solely on evolving individuals 
under global alignments for the S set cover. We did, in addition, consider only 
a subset of A r ', denoted by A r,1 1 comprising sequence pairs that are relative 
to the BAliBASE reference set 1. In this case, the fitness function to be max- 
imized is ip h s ,& l (A r ' 1 ), defined as in Jfjjl when A 7 "' 1 substitutes for A r . Given 
these simplifications, computing through each generation has taken roughly 20 
minutes. 

The values of Lp h s ' 9 {A r ' 1 ) for the substitution matrices of Table El are given in 
Table for global alignments only. Notice that this table also contains values 
for the individual fitness components p s ' ^(yT' 1 ), . . . , Ps%A r ' 1 ) for each matrix; 

9 These data refer to an Intel Pentium 4 processor running at 2.26 GHz. 
10 A finer-grained opportunity for fully independent parallelism can also be identified 
if we recognize that computing ip h g' a {A. r ) in essence boils down to computing each of 
Ps'l^X y y) through pg' g ^{A r x yi^x y)> independently, for every pair (X,Y) having 
a reference alignment in A r ■ Harnessing this form of parallelism is infeasible, though, given 
the current technological reality. 
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Table 6: Values of ip h s ,& '{A 1 "' 1 ) and of Pg' 9 1 (A r ' 1 ), . . . , p s ' \{A r ' 1 ) for the matrices 
of Table |2 under global alignments. 
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Figure 3: Evolution of the fitness, as given by JHJ on A r ' 1 , under global align- 
ments for the S set cover. 

these will be used shortly. In Table as in Table a bold typeface is used to 
indicate extremal values within each of the five numeric columns. 

FigureOland Tabled summarize the results of this smaller-scale experiment. 
The plot in Figure|21is analogous to each of the plots in Figurc|21and, like them, 
is given against the dashed lines that indicate the values highlighted in the 
leftmost numeric column of Table [H] It is presented as two juxtaposed plots on 
the initial and final 150 generations simply for the sake of emphasizing the rapid 
fitness growth during the first few tens of generations, on the one hand, and the 
very slow growth thereafter, on the other (during the generations that the plot 
skips there is growth in one single generation only) . Table [7| is analogous to 
Table [SJ indicating the parameter values that characterize the fittest individual 
at the end of the run of the genetic algorithm. 

What is interesting in this second set of results is that, even though nothing 
resembling the phase-transition-like behavior alluded to above has taken place, 
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Table 7: Values of the parameters of Tabled at the end of the run depicted in 
Figure 01 

Parameter Final value 
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w~ 


0.95 


w + 


3.625 
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1.875 


bs 
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s- 
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S+ 


5.5 


I'- 


1.125 


ll 


7.5 


a 


1.5 



the fitness of the substitution matrix and gap costs that arise from the param- 
eter values of Table[7| specifically 1.5797, is now very near 1.5865, which is the 
highest value appearing in the leftmost numeric column of Table|Sl In addition, 
let us consider the greatest values of each of Ps' 9 1 {A r ' 1 ), . . . , Ps'liA 7 "' 1 ) f° r eacn 
generation. Plotting these values against the corresponding minima and max- 
ima highlighted in the rightmost four columns of Tabic yields what is shown 
in Figure 0] which clearly indicates that the genetic algorithm very quickly 
produces a substitution matrix, with associated gap costs, that surpasses the 
champion of Table HO as far as the fitness components Ps%(A r ' 1 ) and p l g\(A r ' 1 ) 

are concerned, even though it lags behind in terms of p h s ' 9 1 (A r ' 1 ) and p s ' ?,(A r:1 ). 
This substitution matrix, it turns out, is then superior to all the matrices of 
Table [21 when it comes to stressing alignment columns that lie within motifs. 

4 Concluding remarks 

We have introduced a new methodology for the determination of amino- acid sub- 
stitution matrices. The new methodology starts with a set cover of the residue 
alphabet under consideration and builds an undirected graph in which node 
vicinity is taken to represent residue exchangeability. The desired substitution 
matrix arises as a function of weighted distances in this graph. Determining 
the edge weights, and also how to convert the resulting weighted distances into 
substitution-matrix elements, constitute the outcome of an optimization process 
that runs on a set of reference sequence alignments and also outputs gap costs for 
use with the substitution matrix. Our methodology is then of a hybrid nature: 
it relies both on the structural and physicochemical properties that underlie the 
set cover in use and on an extant set of reference sequence alignments. 

The optimization problem to be solved is well-defined: given parameterized 
functional forms for turning cover sets into edge weights and weighted distances 
into substitution-matrix elements, the problem asks for parameter values and 
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Figure 4: Evolution of each of the fitness components p h s ' 9 1 {A r ' 1 ), . . . , p s ' K^*"' 1 ), 
shown respectively in (a) through (d), under global alignments for the S set 
cover. 
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gap costs that maximize a certain objective function on the reference set of 
alignments. We have reported on computational experiments that use a genetic 
algorithm as optimization method and the BAliBASE suite as the source of the 
required reference alignments. Our results are supportive of the following main 
conclusions. First, that the overall methodology is capable of producing substi- 
tution matrices whose performance falls within the same range of a number of 
known matrices' even before any optimization is actually performed (i.e., based 
on the random parameter instantiation that precedes the genetic algorithm); 
this alone, we believe, singles out our methodology as a principled way of de- 
termining substitution matrices that concentrates all the effort related to the 
structure and physicochemical properties of amino acids on the discovery of an 
appropriate set cover. Secondly, that there are scenarios for which the method- 
ology we introduce can already be claimed to yield a substitution matrix that 
surpasses all the others against which it was tested. 

We have also found that strengthening this latter conclusion so that it holds 
in a wider variety of scenarios depends on how efficiently we can run the genetic 
algorithm. Fortunately, it appears that it is all a matter of how many processors 
can be amassed for the effort, since the genetic procedure is inherently amenable 
to parallel processing and highly scalable, too. There is, of course, also the 
issue of investigating alternative functional forms and parameter ranges to set 
up the optimization problem, and in fact the issue of considering other objective 
functions as well. Together with the search for faster optimization, these issues 
make for a very rich array of possibilities for further study. 
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